Recent work has shown the benefits of synthetic data for use in computer vision, with applications ranging from autonomous driving to face landmark detection and reconstruction. There are a number of benefits of using synthetic data from privacy preservation and bias elimination to quality and feasibility of annotation. Generating human-centered synthetic data is a particular challenge in terms of realism and domain-gap, though recent work has shown that effective machine learning models can be trained using synthetic face data alone. We show that this can be extended to include the full body by building on the pipeline of Wood et al. to generate synthetic images of humans in their entirety, with ground-truth annotations for computer vision applications. In this report we describe how we construct a parametric model of the face and body, including articulated hands; our rendering pipeline to generate realistic images of humans based on this body model; an approach for training DNNs to regress a dense set of landmarks covering the entire body; and a method for fitting our body model to dense landmarks predicted from multiple views.
translated by 谷歌翻译
Our experience of the world is multimodal -we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research.
translated by 谷歌翻译
This paper presents a 3D generative model that uses diffusion models to automatically generate 3D digital avatars represented as neural radiance fields. A significant challenge in generating such avatars is that the memory and processing costs in 3D are prohibitive for producing the rich details required for high-quality avatars. To tackle this problem we propose the roll-out diffusion network (Rodin), which represents a neural radiance field as multiple 2D feature maps and rolls out these maps into a single 2D feature plane within which we perform 3D-aware diffusion. The Rodin model brings the much-needed computational efficiency while preserving the integrity of diffusion in 3D by using 3D-aware convolution that attends to projected features in the 2D feature plane according to their original relationship in 3D. We also use latent conditioning to orchestrate the feature generation for global coherence, leading to high-fidelity avatars and enabling their semantic editing based on text prompts. Finally, we use hierarchical synthesis to further enhance details. The 3D avatars generated by our model compare favorably with those produced by existing generative techniques. We can generate highly detailed avatars with realistic hairstyles and facial hair like beards. We also demonstrate 3D avatar generation from image or text as well as text-guided editability.
translated by 谷歌翻译
使用摄像机和计算算法的生理学(例如心脏和肺)生理学的非侵入性,低成本和可扩展性测量的生命体征非常有吸引力。但是,代表各种环境,身体运动,照明条件和生理状态的各种数据是费力的,耗时且昂贵的。合成数据已被证明是机器学习的几个领域的有价值工具,但并未广泛用于摄像机测量生理状态。合成数据提供“完美”标签(例如,没有噪声且具有精确的同步),可能无法获得其他标签(例如,精确的像素级分段图),并提供了对数据集中变化和多样性的高度控制。我们提供Scamps,这是一个合成学数据集,其中包含2,800个视频(168万帧),并带有对齐的心脏和呼吸信号以及面部动作强度。 RGB框架与分割图一起提供。我们提供有关潜在波形的精确描述性统计数据,包括beat间间隔,心率变异性和脉搏到达时间。最后,我们介绍了这些合成数据和对现实世界数据集测试的基线结果培训,以说明可推广性。
translated by 谷歌翻译
地标通常在面部分析中起关键作用,但是仅凭稀疏地标就不能代表身份或表达的许多方面。因此,为了更准确地重建面,地标通常与其他信号(如深度图像或技术)相结合,例如可区分渲染。我们可以通过使用更多地标使事情变得简单吗?在答案中,我们提出了第一种准确地预测10倍地标的方法,覆盖整个头部,包括眼睛和牙齿。这是使用合成培训数据来完成的,该数据保证了完美的地标注释。通过将可变形的模型拟合到这些密集的地标,我们可以在野外实现单眼3D面重建的最新结果。我们表明,密集的地标是通过在单眼和多视图方案中展示准确和表现力的面部绩效捕获来整合跨帧面部形状信息的理想信号。这种方法也非常有效:我们可以预测密集的地标,并在单个CPU线程上以超过150fps的速度适合我们的3D面模型。请参阅我们的网站:https://microsoft.github.io/denselandmarks/。
translated by 谷歌翻译